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Abstract 

We investigate projection methods, for eval- 
uating a linear approximation of the value 
function of a policy in a Markov Decision 
Process context. We consider two popular 
approaches, the one-step Temporal Differ- 
ence fix-point computation (TD(0)) and the 
Bellman Residual (BR) minimization. We 
describe examples, where each method out- 
performs the other. We highlight a sim- 
ple relation between the objective function 
they minimize, and show that while BR en- 
joys a performance guarantee, TD(0) does 
not in general. We then propose a unified 
view in terms of oblique projections of the 
Bellman equation, which substantially sim- 
plifies and extends the characterization of 
Schoknecht (2002) and the recent analysis of 
Yu & Bertsekas (2008). Eventually, we de- 
scribe some simulations that suggest that if 
the TD(0) solution is usually slightly better 
than the BR solution, its inherent numerical 
instability makes it very bad in some cases, 
and thus worse on average. 



Introduction 

We consider linear approximations of the value func- 
tion of the policy in the framework of Markov Deci- 
sion Processes (MDP). We focus on two popular meth- 
ods: the computation of the projected Tempo- 
ral Difference fixed point (TD(0), TD for short), 
which Antos et al. (2008); Farahmand et al. (2008); 
Sutton et al. (2009) have recently presented as the 
minimization of the mean-square projected Bellman 
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Equation, and the minimization of the mean- 
square Bellman Residual (BR). In this article, we 
present some new analytical and empirical data, that 
shed some light on both approaches. The paper is 
organized as follows. Section 1 describes the MDP lin- 
ear approximation framework and the two projection 
methods. Section 2 presents small MDP examples, 
where each method outperforms the other. Section 
3 highlights a simple relation between the quantities 
TD and BR optimize, and show that while BR enjoys a 
performance guarantee, TD does not in general. Sec- 
tion 4 contains the main contribution of this paper: 
we describe a unified view in terms of oblique projec- 
tions of the Bellman equation, which simplifies and 
extends the characterization of Schoknecht (2002) and 
the recent analysis of Yu & Bertsekas (2008). Eventu- 
ally, Section 5 presents some simulations, that address 
the following practical questions: which of the method 
gives the best approximation? and how useful is our 
analysis for selecting it a priori? 

1. Framework and Notations 

The model We consider an MDP with a fixed pol- 
icy, that is an uncontrolled discrete-time dynamic sys- 
tem with instantaneous rewards. We assume that 
there is a state space X of finite size N. When at 
state i € {1, .., N}, there is a transition probability 
Pij of getting to the next state j. Let ik the state of 
the system at time k. At each time step, the system is 
given a reward 7 fc r(ifc) where r is the instantaneous re- 
ward function, and < 7 < 1 is a discount factor. 
The value at state i is defined as the total expected re- 
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write P the N x N stochastic matrix whose elements 
are p^. v can be seen as a vector of M. N . v is known 
to be the unique fixed point of the Bellman operator: 
Tv := r + jPv, that is v solves the Bellman Equation 
V = Tv and is equal to L~ l r where L = I — 7P. 
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Approximation Scheme When the size N of the 
state space is large, one usually comes down to solving 
the Bellman Equation approximately. One possibility 
is to look for an approximate solution v in some specific 
small space. The simplest and best understood choice 
is a linear parameterization: Vi, v(i) = w j ( l ) j(i) 
where m <C N, the <f>j are some feature functions that 
should capture the general shape of v, and Wj are the 
weights that characterize the approximate value v. For 
all i and j, write (f>j the iV-dimensional vector corre- 
sponding to the j th feature function and 4>(i) the Tri- 
dimensional vector giving the features of state i. For 
any vector of matrix X, denote X' its transpose. The 
following N x m feature matrix $ = (cf>i . . . <p m ) = 
. . . 4>{ijq))' leads to write the parameterization 
of v in a condensed matrix form: v = $u>, where 
w = (wi, w m ) is the m-dimensional weight vec- 
tor. We will now on denote span ($) this subspacc 
of M. N and assume that the vectors <j>\, ...,<^ m form a 
linearly independent set. 

Some approximation v of v can be obtained by mini- 
mizing v h- > \\i> — v\\ for some norm || • ||, that is equiva- 
lently by projecting v onto span (<&) orthogonally with 
respect to || • ||. In a very general way, any symmetric 
positive definite matrix Q of K induces a quadratic 
norm || • \\q on M. N as follows: \\v\\q = ^/v'Qv. It 
is well known that the orthogonal projection with re- 
spect to such a norm, which we will denote II||.|| Q , 
has the following closed form: H 11-11 q = ^IHIq wnere 
7T||.|| Q = ( < f>'Q$) _1 <f>'(5 is the linear application from 
M. N to W m that returns the coordinates of the pro- 
jection of a point in the basis ($>i, . . . , 4> m ). With 
these notations, the following relations 7T||.|| g $ = / 
and T||-|| Q n||.|| Q = 7T||.|| Q hold. 

In an MDP approximation context, where one is mod- 
cling a stochastic system, one usually considers a spe- 
cific kind of norm/projection. Let £ = be some 
distribution on X such that £ > (it assigns a positive 
probability to all states) . Let S be the diagonal matrix 
with the elements of £ on the diagonal. Consider the 
orthogonal projection of M. N onto the feature space 
span («f>) with respect to the ^-weighted quadratic 

norm \\v\\^ = £,i v i 2 = v 7 v 'Su. For clarity 

of exposition, we will denote this specific projection 
II := II||.||_ = $7r where it := 7T||.|| S = ($'S<I>)~ 1 $ / S. 

Ideally, one would like to compute the "best" approx- 
imation 

Vbest = $w best with W best = TTV = kIT^t. 

This can be done with algorithms like TD(1) / 
LSTD(l)(Bertsekas & Tsitsiklis, 1996; Boyan, 2002), 
but they require simulating infinitely long trajectories 



and usually suffer from a high variance. The projec- 
tions methods, which we focus on in this paper, are 
alternatives that only consider one-step samples. 

TD(0) fix point method The principle of the 
TD(0) method (TD for short) is to look for a fixed 
point of ITT, that is, one looks for vtd in the space 
span («f>) satisfying vtd = HTvtd- Assuming that 
the matrix inverse below exists 1 , it can be proved 2 
that vtd = &wtd with 

w TD = ($'5i$) _1 $'Sr (1) 

As pointed out by Antos et al. (2008); 
Farahmand et al. (2008); Sutton et al. (2009), 
when the inverse exists, the above computation is 
equivalent to minimizing for v G span (<f>) the TD 
error Etd(v) := \\v — IITv||f down to 3 . 

BR minimization method The principle of the 
Bellman Residual (BR) method is to look for v G 
span («f>) so that it minimizes the norm of the Bellman 
Residual, that is the quantity Ebr(v) := ||£> — Ti>\\^. 
Since v is of the form <f>w, it can be seen that Ebr{v) = 
\\$w — -fP&w — r||j = W^w — r\\% using the notation 
= L$. Using standard linear least squares argu- 
ments, one can see that the minimum BR is obtained 
for vbr = &wbr with 

wbr := (y'Z^y^'Zr. (2) 

Note that in this case, the above inverse always exists 
(Schoknecht, 2002). 

2. Two simple examples 

Example 1 Consider the 2 state MDP such that 
P = f° J J . Denote the rewards r\ and r 2 . One 
thus have v(l) = r\ + and v(2) = j^-. Consider 
the one-feature linear approximation with $ = (1 2)', 
with uniform distribution £ = (.5 .5)'. = |, 

therefore tt = (||) , and the weight of the best approx- 
imation is Wbest — irv — \r\+ 5 ^^ r 2 . This example 
has been proposed by Bertsekas & Tsitsiklis (1996) in 
order to show that fitted Value Iteration can diverge 
if the samples are not generated by the stationary 
distribution of the policy. In (Bertsekas & Tsitsiklis, 
1996), the authors only consider the case n = r 2 = 

This is not necessary the the forthcoming Ex- 

ample 1 (Section 2) shows. 

2 Section 4 will generalize this derivation. 

3 This remark is also true if we replace || ■ ||^ by any 
equivalent norm || ■ ||. This observation lead Sutton et al. 
(2009) to propose original off-policy gradient algorithms 
for computing the TD solution. 
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Figure 1. Error ratio (in log scale) between the TD/BR 
projection methods and the best approximation for Exam- 
ple 1, with respect to the discount factor 7 and the param- 
eter 6 of the reward (Left). It turns out that these surfaces 
do not depend on 8 so we also draw the graph with respect 
to 7 only (Right). 



so that this diverging result was true even though 
the exact value function v(0) = v(l) = did be- 
long to the feature space. In the case r\ = r2 = 0, 
the TD and BR methods do calculate the exact so- 
lution (we will see later that this is indeed a general 
fact when the exact value function belongs to the fea- 
ture space). We thus extend this model by taking 
{ r ii r 2) 7^ (0,0). As a scaling of the reward is trans- 
lated exactly in the approximation, wc consider the 
general form (r - !,^) = (cos#,sin#). 

Consider the TD solution: one has $'H = (| 1), 
(I - 7 P)$ = (1 - 2 7 1 - 7), thus ($'Etf) = |- 37 
and $'Sr = ^ + r%. Eventually the weight of the 
TD approximation is wtd = ^^g^ 2 ■ One notices 
here that the value 7 = 5/6 is singular. Now, con- 
sider the BR solution. One can see that (^'S 1 !') -1 = 

(l-27) a +(2-2 7 )' and $ / Sr = (l-2 7 )r 1 + (2-2 7 )r 2 _ 

the weight of the BR approximation is wbr = 

(l-2 7 )n + (2-2 7 )r 2 
(l-2 7 )* + (2-2 7 )* • 

For all these approximations, one can compute the 
squared error e with respect to the optimal solution 
v. For any weight w G {wbest, wtd, wbr}, e(w) = 
\\v - $to||| = |(w(l) - wf + ±0(2) - 2w) 2 . In Fig- 
ure 1, we plot the squared error ratios t^ TD \ and 

e(wb C R t) on a sca ^ e (^ ne y arc by definition greater 
than 1) with respect to 8 and 7. It turns out that 
these ratios do not depend on 9 (instead of showing 
this through painful arithmetic manipulations, we will 
come back to this point and prove it later on). This 
Figure also displays the graph with respect to 7 only. 
We can observe that for any choice of reward function 
and discount factor, the BR method returns a bet- 
ter value than the TD method. Also, when 7 is in 
the neighborhood of | , the TD error ratio tends to 00 
while BR's stays bounded. This Example shows that 
there exists MDPs where the BR is consistenly better 



than the TD method, which can give an unbounded 
error. One should however not conclude too quickly 
that BR is always better than TD. The literature con- 
tains several arguments in favor of TD, one of which 
is considered in the following Example. 

Example 2 Sutton et al. (2009) recently described a 
3-state MDP example where the TD method computes 
the best projection while BR does not. The idea be- 
hind this 3-state example can be described in a quite 
general way 4 : Suppose we have a k + Z-state MDP, 
of which the Bellman Equation has a block triangular 
structure: v\ = 7P1U1 + r\ / v 2 = 7^21^1 + P22V2 + ?*2 
where v\ £ R fc and V2 G M (the concatenation of 
the vectors v\ and Vi form the value function). Sup- 
pose also that the approximation subspace span (<f>) 
is R fc x £2 where S2 is a subspace of Ms . For the first 
component v\, the approximation space is the entire 
space R fc . With TD, we obtain the exact value for 
the k first components of the value, while with Bell- 
man residual minimization, we do not: satisfying the 
first equation exactly is traded for decreasing the error 
in satisfying the second one (which also involves v\). 
In an optimal control context, the example above can 
have quite dramatic implications, as v\ can be related 
to the costs at some future states accessible from those 
states associated with V2, and the future costs are all 
that matters when making decisions. 

Overall, the two methods generate different types of 
biases, and distribute error in different manners. In 
order to gain some more insight, we now turn on to 
some analytical facts about them. 

3. A Relation and Stability Issues 

Though several works have compared and considered 
both methods (Schoknecht, 2002; Lagoudakis & Parr, 
2003; Munos, 2003; Yu & Bertsekas, 2008), the follow- 
ing simple fact has, to our knowledge, never been em- 
phasized per se: 

Proposition 1 The BR is an upper bound of the TD 
error, and more precisely: 

Vw G span ($) , E BR (v) 2 = E TD (v) 2 + \\Tv - UTv\\ 2 . 

Proof This simply follows from Pythagore, as T\Tv — 
Tv is orthogonal to span ($) and v — HTv belongs to 
span(<3>). ■ 

This implies that if one can make the BR small, then 
the TD Error will also be small. In the limit case where 

4 The rest of this section is strongly inspired by a per- 
sonal communication with Yu. 
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one can make the BR equal to 0, then the TD Error 
is also 0. 

One of the motivation for minimizing the BR 
is historically related to a well-known result of 
Williams & Baird (1993): V<0, ||t> - v|joo < j^z\\Tv - 
0||oo. Since one considers the weighted quadratic norm 
in practice 5 , the related result 6 that really makes sense 

here is: Vv, \\v — v\\^ < ^ - \\Tv — v\\^ where 

C(£) := maxj 3 [ s a "concentration coefficient", 
that can be seen as some measure of the stochastic- 
ity of the MDP 7 . This result shows that it is sound 
to minimize the BR, since it controls (through a con- 
stant) the approximation error \\v — vbrWz- 

On the TD side, there does not exist any similar re- 
sult. Actually, the fact that one can build examples 
(like Example 1) where the TD projection is numeri- 
cally unstable implies that one cannot prove such a re- 
sult. Proposition 1 allows to understand better the TD 
method: by minimizing the TD Error, one only min- 
imizes one part of the BR, or equivalently this means 
that one does not care about the term \\Tv — IITu|||, 
which may be interpreted as a measure of adequacy 
of the projection II with the Bellman operator T . In 
Example 1, the approximation error of the TD pro- 
jection goes to infinity because this adequacy term di- 
verges. In (Munos & Szepesvari, 2008), the authors 
use an algorithm based on the TD Error and make an 
assumption on this adequacy term (there called the in- 
herent Bellman error of the approximation space), so 
that their algorithm can be proved convergent. 

A complementary view on the potential instability 
of TD, has been referred to as a norm incompatibil- 
ity issue (Bertsekas & Tsitsiklis, 1996; Guestrin et al., 
2001), and can be revisited through the notion of con- 
centration coefficient. Stochastic matrices P statisfy 
1 1 P I |oo = 1, which makes the Bellman operator T 7- 
contracting, and thus its fixed point is well-defined. 
The orthogonal projection with respect to || ■ ||j is 
such that ||II||j = 1. Thus P and II arc of norm 
1, but for different norms. Unfortunately, a gen- 
eral (tight) bound for linear projections is ||n||oo < 

5 Mainly because it is computationnally easier 
than doing a max-norm minimization, see however 
(Guestrin et al., 2001) for an attempt of doing max-norm 
projection. 

6 The proof is a consequence of Jensen's inequality and 
the arguments are very close to the ones in (Munos, 2003). 

7 If £ is the uniform law, then there always exists such 
a C(£) £ (1, N) where one recalls that N is the size of the 
state space; in such a case, C(£) is minimal if all next-states 
are chosen with the uniform law, and maximal as soon as 
there exists a deterministic transition. See (Munos, 2003) 
for more discussion on this coefficient. 



1+ 2 (Thompson, 1996) and it can be shown 8 that 
|| P || £ < (which can thus also be of the order 

of y/N). Consequently, ||ILP||oo and ||IXP|| C may be 
greater than 1, and thus the fixed point of the pro- 
jected Bellman equation may not be well-defined. A 
known exception where the composition IIP has norm 
1, is when one can prove that \\P\\^ = 1 (as for instance 
when £ is the stationary distribution of P) and in 
this case we know from Bertsekas & Tsitsiklis (1996); 
Tsitsiklis & Van Roy (1997) that 

\\V - V T d\U < —7==\\V - Vbesth- (3) 

V 1 -7 

Another notable such exception is when |JII|j„ ia:r = 1, 
as in the so-called "averager" approximation (Gordon, 
1995). However, in general, the stability of TD is dif- 
ficult to guarantee. 

4. The unified oblique projection view 

In the TD approach, we consider finding the fixed 
point of the composition of an orthogonal projection 
IT and the Bellman operator T. Suppose now we con- 
sider using a (non necessarily orthogonal) projection 
II onto span (</>), that is any linear operator that sat- 
isfies II 2 = II and whose range is span (<f>). In their 
most general form, such operators are called oblique 
projections and can be written Tlx = Qitx with 
ttx = (A'<i>) _1 X' . The parameter X specifies the pro- 
jection direction: precisely, Tlx is the projection onto 
span (<f>) orthogonally to span (X). As for the orthog- 
onal projections, the following relations irx $ = I and 
TOf Tlx = nx hold. Recall that L = I — jP. We are 
ready to state the main result of this paper: 

Proposition 2 Write Xtd = 2$ and Xbr = SL$. 

(1) The TD fix point computation and the BR min- 
imization are solutions (respectively with X = Xtd 
and X = Xbr) of the projected equation vx = 
TlxTix- (2) When it exists, the solution of this pro- 
jected equation is the projection of v onto span (<I>) or- 
thogonally to span(L'A), i.e. formally vx = Tl^'x v. 

Proof We begin by showing part (2). Writing Vx = 
<f>wx, the fixed point equation is: $u>x = Tlx(r + 
-fP&w x ). Multiplying on both sides by irx, one ob- 
tains: wx = ftx (f + 'jP'&Wx) and therefore wx = 
(I — ~/irxP$ > )~ 1 TTxT'. Using the definition of irx, one 

8 One can prove that for all x, ||Pa;||| < ||a;|||p < 
C(^)||a;|||. The argument for the first inequality involves 
Jensen's inequality and is again close to what is done in 
(Munos, 2003). 
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obtains: 

w x = (7-7(X'$)- 1 X'P$)- 1 (X'$)- 1 X'r 

= [(X'$)(I-'y(X'$)- 1 X'P$)]~ 1 X'r 

= {X'{I - 1 P)<S>)- 1 X'r (4) 
= (X'L$)- 1 X'Lv 

= TTL'X v 

where we enventually used r = Lv. 

The proof of part (1) now follows. The fact that TD 
is a special case with X — S<f> is trivial by construc- 
tion since then Tlx is the orthogonal projection with 
respect to || • When X = SL$, one simply needs 
to observe from Equations 2 and 4 and the definition 
of ^ = L3> that wx = wbr- M 

Beyond its nice and simple geometric flavour, a direct 
consequence of Proposition 2 is that it allows to derive 
tight error bounds for TD, BR, and any other method 
for general X. For any square matrix M, write o~(M) 
its spectral radius. 

Proposition 3 For any choice of X , the approxima- 
tion error satisfies: 

\\v-vx\\$ - W^L'xhh - Vbesth ( 5 ) 
= ^/o-(ABCB')\\v~v b est\k 

where A = B = {X'L®)- 1 and C = 

XLS L'X are matrices of size m x m. 

Thus, for any X, the amplification of the smallest er- 
ror — z>6es* || ^ depends on the norm of the associated 
oblique projection, which can be estimated as the spec- 
tral radius of the product of small matrices. A simple 
corollary of this Proposition is the following: if the real 
value v belongs to the feature space span ($) (in such 
a case v = Vbest) then all oblique projection methods 
find it (vx = v). 

Proof of Proposition 3 Proposition 2 implies that 
v — vx = {I — T1l'x)v = (I — TLl'x)(I ~ IIa*)u. where 
we used the fact that Hl'xHe® = Ife* since Hl'x and 
are projections onto span($). Taking the norm, 
one obtains ||u — vx\\z < \\I — Kl'xWzWv — ^e<j>v\\^ = 
\\^L'x\\^\\v — Wfc est ||^ where we used the definition of 
v best , and the fact that ||7- ILvxIle = ||Il£/x||£ since 
Hl<x is a (non-trivial) projection (see e.g. (Szyld, 
2006)). Thus Equation 5 holds. 

In order to evaluate the norm in terms of small size 
matrices, one will use the following Lemma on the pro- 
jection matrix Hl'x = &tl'x'- 



Lemma 1 (Yu & Bertsekas (2008)) Let Y be an 

N x m matrix, and Z a m x N matrix, then \\YZ\\^ = 
a((Y>ZY)(ZE-iZ')). 

Thus, lln^xlH II^L'A-llf 

a[($'S$)(7r^xH- 1 (7r i . x ) / )] 
a[&E$(X'L$)- 1 X'LE- 1 L , X(&L , X)- 1 ] 
a[ABCB'}. M 

Proposition 2 is closely related to the work of 
(Schoknccht, 2002), in which the author derived the 
following characterization of the TD and BR solutions: 

Proposition 4 (Schoknecht (2002)) The TD fix 

point computation and the BR minimization are or- 
thogonal projections of the value v respectively induced 
by the seminorm \\ - Wqt^ with Qtd = L'E<fr<fr'EL and 
by the norm \\ ■ \\q br with Qbr = L'EL. 

This "orthogonal projection" characterization and our 
"oblique projection" characterization are in fact equiv- 
alent. On the one hand for BR, it is immediate 
to notice that II||.|| Qb = Hl'Xbr- O n the other 
hand for TD, writing Y = L'Xtd, one simply needs 
to notice that I\- L 'X TD = n y - = ^(Y'^)-^' = 
$(y / $)- 1 ($'r)- 1 ($'F)y' = <$>{<S>'YY'<S>)~ 1 <S>'YY' = 
n||.|| Q . The work of Schoknecht (2002) suggests 
that TD and BR are optimal for different criteria, 
since both look for some v G span (<f>) that minimizes 
||{S — v\\ for some (semi)norm || ■ j|. Curiously, our re- 
sult suggests that neither is optimal, since neither uses 
the best projection direction X* := for which 

vx* = = nH$v = iibest and this supports the 

empirical evidence that there is no clear "winner" be- 
tween TD and BR. 

Our main results, stated in Propositions 2 and 3, 
constitutes a revisit of the work of Yu & Bertsekas 
(2008), where the authors similarly derived error 
bounds for TD and BR. Our approach mimicks theirs: 
1) we derive a linear relation between the projec- 
tion v, the real value v and the best projection v bes t, 
then 2) analyze the norm of the matrices involved 
in this relation in terms of spectral radius of small 
matrices (through Lemma 1, which is taken from 
(Yu & Bertsekas, 2008)). From a purely quantitative 
point of view, our bounds are identical to the ones de- 
rived there. Two immediate consequences of this quan- 
titative equivalence are that, as in (Yu & Bertsekas, 

9 This is a seminorm because the matrix Qtd is only 
semidefinite (since <J?$' has rank smaller than m < N). 
The corresponding projection can still be well defined 
(i.e. each point has exactly one projection) provided that 
span ($) n {x; \\x\\ Qtd = 0} = {0}. 
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2008), (1) our bound is tight in the sense that there 
exists a worst choice for the reward for which it 
holds with equality, and (2) it is always better than 
that of Equation 3 from Bertsckas & Tsitsiklis (1996); 
Tsitsiklis & Van Roy (1997). However, our work is 
qualitatively different: by highlighting the oblique pro- 
jection relation between v and v, not only do we pro- 
vide a clear geometric intuition for both methods, but 
we also greatly simplify the form of the results and 
their proofs (see (Yu & Bertsekas, 2008) for details). 

Last but not least, there is globally a significant dif- 
ference between our work and the two works we have 
just mentionncd. The analysis we propose is unified for 
TD and BR (and even extends to potential new meth- 
ods through other choices of the parameter X), while 
the results in (Schoknecht, 2002) and (Yu & Bertsekas, 
2008) are proved independently for each method. We 
hope that our unified approach will help understand- 
ing better the pros and cons of TD, BR, and related 
alternative approaches. 



7 = 0.9 



7 = 0.95 



EJgood prediction] - 



7 = 0.9 



7 = 0.95 L| , 




Figure 2. TD win ratio. 



5. An Empirical Comparison 

In order to further compare the TD and the BR projec- 
tions, we have made some empirical comparison, which 
we describe now. We consider spaces of dimensions 
n = 2,3, ..,30. For each n, we consider projections of 
dimensions k = 1,2, ..,n. For each (n, k) couple, we 
generate 20 random projections (through random ma- 
trices 10 $ of size (n, k) and random weight vectors £) 
and 20 random (uncontrolled) chain like MDP: from 
each state i, there is a probability pi (chosen randomly 
uniformly on (0, 1)) to get to state i + 1 and a proba- 
bility 1 — pi to stay in i (the last state is absorbing); 



10 Each entry is a random uniform number between -1 
and 1. 




Figure 3. Prediction of the best method through Prop. 3 
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Figure 4. Expectation of eTo/esij. 



the reward is a random vector. For the 20 x 20 re- 
sulting combinations, we compute the real value v, its 
exact projection Vbestj the TD fix point vtd, and the 
BR projection vrr- We then deduce the best error 
e = \\v - Vbest\\$, the TD error gtd = \\v - vtd\\z 
and the BR cbr = \\v — vbrWz- Wc also compute 
the bounds of Proposition 3 for both methods: brD 
and bsR- Each such experiment is done for 4 different 
values of the discount factor 7: 0.9, 0.95, 0.99, 0.999. 

Using this raw data on 20 x 20 problems, we compute 
for each (n, k) couple some statistics, which we de- 
scribe now. All the graphs that we display shows the 
dimension of the space N and of the projected space 
to on the x — y axes. The z axis correspond to the 
different statistics of interest. 

Figure 2 shows the proportion of sampled problems 
where TD method returns a better approximation 
than BR (i.e. the expectation of the indicator func- 
tion of exD < £br)- It turns out that this ratio is 
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consistently greater than i, which means that the TD 
method is usually better than the BR method. Figure 
3 presents the ratio of time the bounds we have pre- 
sented in Propostion 4 correctly guesses which method 
is the best (i.e. the expectation of the indicator func- 
tion of [ctd < gbr] = [bm < Unless the 
feature space dimension is close to the state space di- 
mension, the bounds do not appear very useful for 
such a decision. Figure 4 displays the expectation of 
gtd/s-br- One can observe that, on average, this ex- 
pectation is bigger than f, that is the BR tends to 
be better, on average, than the TD error. This may 
look contradictory with our interpretation of Figure 
2, but the explanation is the following: when the BR 
method is better than the TD method, it is by a larger 
gap than when it is the other way round. We believe 
this corresponds to the situation when the TD method 
in unstable. Figure 5 allows to confirm this point: it 
shows the expectation of the relative approximation 
errors with respect to the best possible error, that is 
the expectation of ero/e and esij/e. One observes on 



all charts that this average relative quality of the TD 
fix point has lots of pikes (corresponding to numerical 
instabilities), while that of the BR method is smooth. 

6. Conclusion and Future Work 

We have presented the TD fix point and the BR mini- 
mization methods for approximating the value of some 
MDP fixed policy. We have described two original ex- 
amples: in the former, the BR method is consistently 
better than the TD method, while the latter (which 
generalizes the spirit of the example of Sutton et al. 
(2009)) is best treated by TD. Proposition f highlights 
the close relation between the objective criteria that 
correspond to both methods. It shows that minimiz- 
ing the BR implies minimizing the TD error and some 
extra "adequacy" term, which happens to be crucial 
for numerical stability. 

Our main contribution, stated in Proposition 2, pro- 
vides a new viewpoint for comparing the two pro- 
jection methods, and potential ideas for alternatives. 
Both TD and BR can be characterized as solving a pro- 
jected fixed point equation and this is to our knowledge 
new for BR. Also, the solutions to both methods are 
some oblique projection of the value v and this is to our 
knowledge new for TD and BR. Eventually, this simple 
geometric characterization allows to derive some tight 
error bounds (Proposition 3). We have discussed the 
close relations of our results with those of Schoknecht 
(2002) and Yu & Bertsekas (2008), and argued that 
our work simplifies and extends them. Though ap- 
parently new to the Reinforcement Learning commu- 
nity, the very idea of oblique projections of fixed point 
equations has been studied in the Numerical Analysis 
community (see e.g. Saad (2003)). In the future, we 
plan to study more carefully this literature, and par- 
ticularly investigate whether it may further contribute 
to the MDP context. 

Concerning the practical question of choosing among 
the two methods TD and BR, the situation can be 
summarized as follows: the BR method is sounder 
than the TD method, since the former has a perfor- 
mance guarantee while the latter will never have one 
in general. Extensive simulations (on random chain- 
like problems of size up to 30 states, and for many 
projection of all the possible space sizes) further sug- 
gest the following facts: (a) the TD solution is more 
often better than the BR solution; (b) however some- 
times, TD failed dramatically; (c) overall, this makes 
BR better on average. Equivalently, one may say that 
TD is more risky than BR. 

Even if TD is more risky, there remains several reasons 
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why one may want to use it in practice, and which our 
study did not focus on. In large scale problems, one 
usually estimates the m x m linear systems through 
sampling. Sampling based methods for BR are more 
constraining since they generally require double sam- 
pling. Independently, the fact, highlighted by Propos- 
tion 1, that the BR is an upper bound of the TD error, 
suggests two things. First, we believe that the vari- 
ance of the BR problem is higher than that of the TD 
problem; thus, given a fixed amount of samples, the 
TD solution might be less affected by the correspond- 
ing stochastic noise than the BR one. More generally, 
the BR problem may be harder to solve than the TD 
problem, and from a numerical viewpoint, the latter 
may provide better solutions. Eventually, we only dis- 
cussed the TD(0) fix point method, that is the specific 
variant of TD(A) (Bertsekas & Tsitsiklis, 1996; Boyan, 
2002) where A = 0. Values of A > solve some of the 
weaknesses of TD(0): it can be show that the stabil- 
ity issues disappear for values of A close to 1, and the 
optimal projection ibest is obtained when A = 1. Fur- 
ther analytical and empirical comparisons of TD(A) 
with the algorithms we have considered here (and with 
some "BR(A)" algorithm) constitute future research. 

Eventually, a somewhat disappointing observation of 
our study is that the bounds of Proposition 3, which 
are the tightest possible bounds independent of the 
reward function, did not prove useful for deciding a 
priori which of the two methods one should trust bet- 
ter (recall the results showed in Figure 3). Extending 
them in a way that would take the reward into ac- 
count, as well as trying to exploit our original unified 
vision of the bounds (Propositions 2 and 3) are some 
potential tracks for improvement. 
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